NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

ExpertAF: Expert Actionable Feedback from Video

https://doi.org/10.1109/CVPR52734.2025.01268

Ashutosh, Kumar; Nagarajan, Tushar; Pavlakos, Georgios; Kitani, Kris; Grauman, Kristen (June 2025, IEEE)

Feedback is essential for learning a new skill or improving one’s current skill-level. However, current methods for skill-assessment from video only provide scores or compare demonstrations, leaving the burden of knowing what to do differently on the user. We introduce a novel method to generate actionable feedback (AF) from video of a person doing a physical activity, such as basketball or soccer. Our method takes a video demonstration and its accompanying 3D body pose and generates (1) free-form expert commentary describing what the person is doing well and what they could improve, and (2) a visual expert demonstration that incorporates the required corrections. We show how to leverage Ego-Exo4D’s [29] videos of skilled activity and expert commentary together with a strong language model to create a weakly-supervised training dataset for this task, and we devise a multimodal video-language model to infer coaching feedback. Our method is able to reason across multi-modal input combinations to output fullspectrum, actionable coaching—expert commentary, expert video retrieval, and expert pose generation—outperforming strong vision-language models on both established metrics and human preference studies.
more » « less
Full Text Available
4DIFF: 3D-Aware Diffusion Model for Third-to-First Viewpoint Translation

Cheng, Feng; Luo, Mi; Wang, Huiyu; Dimakis, Alex; Torresani, Lorenzo; Bertasius, Gedas; Grauman, Kristen (May 2025, ECCV '24 https://openreview.net/forum?id=nReyoIseTD)

Abstract. We present 4Diff, a 3D-aware diffusion model addressing the exo-to-ego viewpoint translation task—generating first-person (egocentric) view images from the corresponding third-person (exocentric) images. Building on the diffusion model’s ability to generate photorealistic images, we propose a transformer-based diffusion model that incorporates geometry priors through two mechanisms: (i) egocentric point cloud rasterization and (ii) 3D-aware rotary cross-attention. Egocentric point cloud rasterization converts the input exocentric image into an egocentric layout, which is subsequently used by a diffusion image transformer. As a component of the diffusion transformer’s denoiser block, the 3D-aware rotary cross-attention further incorporates 3D information and semantic features from the source exocentric view. Our 4Diff achieves state-of-the-art results on the challenging and diverse Ego-Exo4D multiview dataset and exhibits robust generalization to novel environments not encountered during training. Our code, processed data, and pretrained models are publicly available at https://klauscc.github.io/4diff.
more » « less
Full Text Available
HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness

Xue, Zihui; Luo, Mi; Chen, Changan; Grauman, Kristen (November 2024, https://doi.org/10.48550/arXiv.2406.07754)

This paper addresses the challenge of precisely swapping objects in videos, particularly those involved in hand-object interactions (HOI), using a single user-provided reference object image. While diffusion models have advanced video editing, they struggle with the complexities of HOI, often failing to generate realistic edits when object swaps involve changes in shape or functionality. To overcome this, the authors propose HOI-Swap, a novel diffusion-based video editing framework trained in a self-supervised manner. The framework operates in two stages: (1) single-frame object swapping with HOI awareness, where the model learns to adjust interaction patterns (e.g., hand grasp) based on object property changes; and (2) sequence-wide extension, where motion alignment is achieved by warping a sequence from the edited frame using sampled motion points and conditioning generation on the warped sequence. Extensive qualitative and quantitative evaluations demonstrate that HOI-Swap significantly outperforms prior methods, producing high-quality, realistic HOI video edits.
more » « less
Full Text Available
ActiveRIR: Active Audio-Visual Exploration for Acoustic Environment Modeling

Somayazulu, Arjun; Majumder, Sagnik; Chen, Changan; Grauman, Kristen (August 2024, International Conference on Intelligent Robots and Systems (IROS))

Full Text Available
Sim2Real Transfer for Audio-Visual Navigation with Frequency-Adaptive Acoustic Field Prediction

Chen, Changan; Ramos, Jordi; Tomar, Anshul; Grauman, Kristen (August 2024, International Conference on Intelligent Robots and Systems (IROS))

Full Text Available
Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos

Majumder, Sagnik; Al-Halah, Ziad; Grauman, Kristen (June 2024, IEEE Conference on Computer Vision and Pattern Recognition (CVPR))

Full Text Available
Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Chen, Changan; Peng, Puyuan; Baid, Ami; Xue, Zihui; Hsu, Wei-Ning; Harwath, David; Grauman, Kristen (July 2024, https://doi.org/10.48550/arXiv.2406.09272)

Generating realistic audio for human actions is critical for applications such as film sound effects and virtual reality games. Existing methods assume complete correspondence between video and audio during training, but in real-world settings, many sounds occur off-screen or weakly correspond to visuals, leading to uncontrolled ambient sounds or hallucinations at test time. This paper introduces AV-LDM, a novel ambient-aware audio generation model that disentangles foreground action sounds from ambient background noise in in-the-wild training videos. The approach leverages a retrieval-augmented generation framework to synthesize audio that aligns both semantically and temporally with the visual input. Trained and evaluated on Ego4D and EPIC-KITCHENS datasets, along with the newly introduced Ego4D-Sounds dataset (1.2M curated clips with action-audio correspondence), the model outperforms prior methods, enables controllable ambient sound generation, and shows promise for generalization to synthetic video game clips. This work is the first to emphasize faithful video-to-audio generation focused on observed visual content despite noisy, uncurated training data.
more » « less
Full Text Available
Institute for Foundations of Machine Learning (IFML): Advancing AI systems that will transform our world

https://doi.org/10.1002/aaai.12163

Klivans, Adam; Dimakis, Alexandros G.; Grauman, Kristen; Tamir, Jonathan I.; Diaz, Daniel J.; Davidson, Karen (March 2024, AI Magazine)

Abstract The Institute for Foundations of Machine Learning (IFML) focuses on core foundational tools to power the next generation of machine learning models. Its research underpins the algorithms and data sets that make generative artificial intelligence (AI) more accurate and reliable. Headquartered at The University of Texas at Austin, IFML researchers collaborate across an ecosystem that spans University of Washington, Stanford, UCLA, Microsoft Research, the Santa Fe Institute, and Wichita State University. Over the past year, we have witnessed incredible breakthroughs in AI on topics that are at the heart of IFML's agenda, such as foundation models, LLMs, fine‐tuning, and diffusion with game‐changing applications influencing almost every area of science and technology. In this article, we seek to highlight seek to highlight the application of foundational machine learning research on key use‐inspired topics:Fairness in Imaging with Deep Learning: designing the correct metrics and algorithms to make deep networks less biased.Deep proteins: using foundational machine learning techniques to advance protein engineering and launch a biomanufacturing revolution.Sounds and Space for Audio‐Visual Learning: building agents capable of audio‐visual navigation in complex 3D environments via new data augmentations.Improving Speed and Robustness of Magnetic Resonance Imaging: using deep learning algorithms to develop fast and robust MRI methods for clinical diagnostic imaging.IFML is also responding to explosive industry demand for an AI‐capable workforce. We have launched an accessible, affordable, and scalable new degree program—the MSAI—that looks to wholly reshape the AI/ML workforce pipeline.
more » « less
Full Text Available
Learning Audio-Visual Dereverberation

https://doi.org/10.1109/ICASSP49357.2023.10095818

Chen, Changan; Sun, Wei; Harwath, David; Grauman, Kristen (June 2023, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP))

Full Text Available
NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory

Ramakrishnan, Santhosh; Al-Halah, Ziad; Grauman, Kristen (January 2023, IEEE Conference on Computer Vision and Pattern Recognition)

Full Text Available

« Prev Next »

Search for: All records